Identifying  Predictive  Structures  in  Relational  Data 
Using  Multiple  Instance  Learning 


Amy  McGovern  amy@CS.umass.edu 

David  Jensen  jensen@CS.umass.edu 

Knowledge  Discovery  Laboratory,  Computer  Science  Dept.,  Univ.  of  Massachusetts  Amherst,  Amherst,  MA  01003,  USA 


Abstract 

This  paper  introduces  an  approach  for  identify¬ 
ing  predictive  structures  in  relational  data  using 
the  multiple-instance  framework.  By  a  predictive 
structure,  we  mean  a  structure  that  can  explain  a 
given  labeling  of  the  data  and  can  predict  labels 
of  unseen  data.  Multiple-instance  learning  has 
previously  only  been  applied  to  flat,  or  proposi¬ 
tional,  data  and  we  present  a  modification  to  the 
framework  that  allows  multiple-instance  tech¬ 
niques  to  be  used  on  relational  data.  We  present 
experimental  results  using  a  relational  modifica¬ 
tion  of  the  diverse  density  method  (Maron,  1998; 
Maron  &  Lozano-Perez,  1998)  and  of  a  method 
based  on  the  chi-squared  statistic  (McGovern  & 
Jensen,  2003).  We  demonstrate  that  multiple- 
instance  learning  can  be  used  to  identify  predic¬ 
tive  structures  on  both  a  small  illustrative  data  set 
and  the  Internet  Movie  Database.  We  compare 
the  classification  results  to  a  /.-nearest  neighbor 
approach. 


1.  Introduction 

Identifying  useful  structures  in  large  relational  databases  is 
a  difficult  task.  For  example,  consider  the  task  of  predict¬ 
ing  which  movies  will  be  nominated  for  academy  awards 
every  year.  The  Internet  Movie  Database  (IMDb)  con¬ 
tains  about  one  hundred  movies  that  were  nominated  for 
academy  awards  in  the  time  period  1970  to  2000  and  thou¬ 
sands  of  movies  that  were  not  nominated  in  this  time  pe¬ 
riod.  We  would  like  to  identify  relational  structure  from  a 
set  of  positive  and  negative  examples  (e.g.,  the  structure 
surrounding  nominated  and  non-nominated  movies)  that 
can  explain  known  labels  and  predict  labels  for  unseen  data. 
Specifically,  given  the  schema  for  the  IMDb  shown  in  Fig¬ 
ure  1,  we  would  like  to  identify  some  substructure  that  can 
predict  which  movies  will  be  nominated  and  which  movies 
will  not  be  nominated.  An  example  substructure  could  be 
a  movie  where  one  of  the  actors  was  previously  nominated 


Figure  1.  Schema  that  we  used  for  the  IMDb 


for  an  academy  award.  Such  structures  are  useful  not  only 
for  classification  and  prediction  tasks  but  also  for  better  un¬ 
derstanding  of  large  relational  databases. 

Multiple  instance  learning  (MIL)  (details  of  MIL  are  given 
in  Section  2)  is  a  promising  framework  for  identifying  pre¬ 
dictive  structures  in  large  relational  databases.  First,  MIL 
methods  are  designed  for  learning  from  ambiguous  and 
partially  labeled  data.  With  relational  data,  it  is  often  easy 
to  label  a  collection  of  objects  and  their  relations.  However, 
labeling  each  individual  object  and  relation  by  its  contribu¬ 
tion  to  the  overall  situation  is  more  difficult.  For  exam¬ 
ple,  we  can  obtain  the  labels  for  the  movie  subgraphs  by 
noting  whether  the  movie  was  nominated  for  an  academy 
award,  but  it  would  be  difficult  to  label  each  actor  and  stu¬ 
dio  by  their  individual  contribution  to  whether  the  movie 
was  nominated  for  an  award.  Second,  multiple -instance 
(MI)  techniques  are  designed  to  identify  which  part  of  the 
data  can  explain  the  labels.  For  example,  the  relations  in 
the  movies  example  could  contain  all  related  movies,  re¬ 
leases,  studios,  etc.,  for  each  nominated  movie,  but  the  best 
concept  might  only  use  the  studio  and  producers  linked  via 
a  movie. 

MIL  has  been  used  successfully  in  a  number  of  applica¬ 
tions  using  propositional  data  (Amar  et  al.,  2001;  Diet- 
terich  et  ah,  1997;  Goldman  et  ah,  2002;  Maron,  1998; 
Maron  &  Ratan,  1998;  Zhang  &  Goldman,  2002;  Zucker 
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&  Che val eyre,  2000).  However,  none  of  these  techniques 
have  examined  MIL  approaches  for  relational  data  even 
though  the  data  set  used  in  the  introductory  MIL  paper  (Di- 
etterich  et  al.,  1997)  was  relational  (it  was  flattened  into 
feature  vectors  to  solve  the  task).  By  working  with  the 
data  in  relational  form,  we  can  detect  structures  that  can¬ 
not  be  represented  in  a  feature  vector  format.  For  example, 
a  link  between  two  related  movies  where  the  movie’s  pro¬ 
ducer  was  also  nominated  for  an  academy  award  for  a  pre¬ 
vious  movie  would  be  very  difficult  to  represent  in  propo¬ 
sitional  data,  especially  if  the  form  of  the  final  structure 
is  not  known  in  advance.  Simply  flattening  the  relational 
data  (into  homogeneous  feature  vectors)  presents  a  number 
of  problems.  The  homogeneity  of  the  resulting  data  either 
means  data  duplication  (which  will  affect  probability  esti¬ 
mation)  or  data  loss  through  aggregation. 

2.  Notation  and  background 

We  use  the  PROXIMITY1  system  to  represent,  store,  and 
query  relational  data  sets.  Let  G  =  (v,  e)  be  a  graph.  Ob¬ 
jects  in  the  world,  such  as  people,  places,  things,  and 
events,  are  represented  as  vertices  in  the  graph.  Relations 
between  these  objects,  such  as  cicted-in(cictor,  movie )  are 
represented  by  edges.  In  general,  if  there  is  a  relation 
r(c>i  ,02),  then  01,02  G  v  and  r  £  e.  In  PROXIMITY,  vertices 
are  called  objects  and  relations  are  called  links.  Both  ob¬ 
jects  and  links  can  have  multiple  attributes  associated  with 
them.  For  example,  using  the  schema  shown  in  Figure  1, 
movies,  people,  genres,  etc.,  are  all  objects.  Relationships 
such  as  awarded(movie,  best-picture)  are  links.  Attributes 
can  be  associated  with  objects,  such  as  movie.name,  or 
links,  such  as  awarded. award-status.  PROXIMITY  allows 
us  to  query  the  database  using  a  graphical  query  language 
called  qGraph  (Blau  et  al.,  2002).  qGraph  provides  a  form 
of  abstraction  on  top  of  SQL  by  allowing  us  to  construct 
visual  queries  of  the  graphical  database.  Queries  return  a 
collection  of  subgraphs  and  not  just  a  set  of  database  rows. 

An  MI  learner  uses  labeled  bags  where  a  bag  is  a  collection 
of  instances  with  one  label  for  the  entire  collection.  A  posi¬ 
tive  bag  contains  at  least  one  instance  of  the  target  concept 
while  a  negative  bag  contains  none.  With  flat  data,  both 
instances  and  target  concepts  are  points  in  feature  space. 
With  relational  data,  both  instances  and  target  concepts  are 
graphs,  or  relations  among  a  (heterogeneous)  set  of  objects. 
The  goal  is  to  find  a  concept  that  explains  the  labels  for  the 
bags  and  can  be  used  to  predict  labels  for  unseen  data.  It 
is  not  known  in  advance  which  instance  caused  the  bag  to 
be  labeled  as  positive.  If  this  were  known,  a  supervised 
learning  approach  could  be  used  instead. 

'For  additional  details  on  PROXIMITY,  see 
http://kdl.cs.umass.edu. 


We  present  an  approach  to  adapting  the  MIL  framework 
for  use  with  relational  data  where  bags  are  collections  of 
graphs.  The  instances  in  the  bag  can  be  either  explicitly 
enumerated  as  a  set  of  graphs  or  they  can  be  the  set  of 
(implicit)  subgraphs  of  a  single,  larger,  graph.  The  struc¬ 
ture  of  the  relational  data  determines  which  representation 
is  most  appropriate.  If  the  data  consist  of  sets  of  disjoint 
graphs,  such  as  the  MUSK  task  where  each  conformation 
of  a  molecule  could  be  represented  as  a  separate  graph, 
then  it  is  better  to  explicitly  enumerate  the  instances  in  each 
bag.  If  the  data  consist  of  a  large  connected  database,  such 
as  IMDb,  then  a  bag  consisting  of  single  large  graph  can 
be  easily  created  by  querying  the  database.  For  example, 
in  IMDb,  the  bags  for  the  movies  nominated  for  academy 
awards  can  be  created  by  querying  for  all  objects  connected 
to  a  nominated  movie  by  one  or  two  links. 

For  the  MI  notation,  we  follow  that  of  Maron  (1998)  and 
Maron  and  Lozano-Perez  (1998).  The  set  of  positive  bags 
is  denoted  B+  and  the  ;th  positive  bag  is  Bt  .  Likewise,  the 
set  of  negative  bags  is  denoted  B  and  the  /th  negative  bag 
is  Bi  .  If  the  discussion  applies  to  both  types  of  bags,  we 
drop  the  superscript  and  refer  to  it  as  B.  The  /th  instance 
of  the  /th  bag  is  denoted  /(,/.  The  target  concept  is  denoted 
c,  and  other  concepts  as  c.  With  flat  data,  a  concept  is  a 
point  in  feature  space.  With  relational  data,  a  concept  is  an 
attributed  graph. 

3.  MI  learning  on  relational  data 

The  data  available  to  an  MI  learner  is  a  set  of  positive  and 
negative  bags,  B+  and  B  .  If  the  concept  is  a  feature  vec¬ 
tor  v,  then  each  bag  consists  of  a  set  of  feature  vectors: 
Bi  =  {v,i ,  v,2,  •  • .  ,Vik}.  The  most  straightforward  transfor¬ 
mation  to  apply  MIL  to  relational  data  is  to  have  each  in¬ 
stance  represented  as  a  separate  graph.  In  this  case,  a  bag 
would  consist  of  a  set  of  graphs:  B,-  =  {G,i,Gq,  •  • .  ,  G,*}. 
The  goal  is  then  to  find  a  concept  that  can  explain  the  la¬ 
beling  of  the  bags.  The  concept,  c,  is  a  subgraph  of  one  the 
graphs  in  B+  and  B  .  This  representation  is  best  suited  for 
tasks  where  the  data  are  already  available  as  a  set  of  dis¬ 
joint  graphs.  The  MUSK  data  set  (Dietterich  et  al.,  1997), 
image  recognition  tasks  (Maron  &  Ratan,  1998),  and  the 
mutagenesis  data  set  (Zucker  &  Chevaleyre,  2000)  fit  into 
this  framework. 

When  the  relational  data  are  available  as  a  large  connected 
graph  instead  of  a  set  of  unconnected  graphs,  it  may  be  eas¬ 
ier  to  identify  a  single  subgraph  as  containing  something 
positive  instead  of  enumerating  every  instance.  For  exam¬ 
ple,  in  the  IMDb,  we  can  hypothesize  that  there  is  some 
relational  structure  surrounding  movies  that  could  be  used 
to  predict  whether  a  movie  gets  nominated  for  an  academy 
award.  Without  knowing  the  structure  in  advance,  it  would 
be  very  difficult  to  create  bags  of  every  possible  struc- 


ture.  However,  it  is  relatively  easy  to  identify  the  depth-two 
structure  surrounding  the  movies  and  to  use  this  to  create 
bags  where  each  bag  has  only  one  graph.  The  instances  are 
assumed  to  be  the  set  of  all  subgraphs  of  the  single  graph 
in  the  bag. 


We  use  the  metric  proposed  by  Bunke  and  Shearer  (1998) 
which  is  based  on  finding  the  maximal  common  subgraph 
(MCS)  between  two  graphs.  They  demonstrate  that  this 
distance  measure  satisfies  the  metric  properties.  The  dis¬ 
tance  between  two  graphs  G\  and  G2  is  defined  as: 


More  formally,  we  propose  to  create  the  set  of  bags  B+ 
and  B  such  that  /J,  =  {G,}  where  G,  is  a  single  (large) 
graph.  The  instances  of  B,  are  assumed  to  be  the  set  of  all 
subgraphs  of  G,.  Since  the  size  of  this  set  is  exponential 
in  the  size  of  G;,  where  |G,  |  is  defined  as  the  sum  of  the 
number  of  vertices  and  edges  in  G,,  we  do  not  explicitly 
enumerate  the  instances  for  each  bag.  Instead,  the  search 
methods  take  into  account  this  assumption. 

3.1.  Relational  diverse  density 

Several  existing  MI  methods  can  be  transformed  to  work 
with  relational  data.  We  adapt  both  diverse  density  (Maron, 
1998;  Maron  &  Lozano-Perez,  1998)  and  chi-squared  (Mc¬ 
Govern  &  Jensen,  2003).  We  first  briefly  review  the  defini¬ 
tions  for  diverse  density.  The  most  diversely  dense  concept 
is  defined  as  that  which  is  closest  to  the  intersection  of  the 
positive  bags  and  farthest  from  the  union  of  the  negative 
bags.  More  precisely,  Maron  defines  the  diverse  density  of 
a  particular  concept  c  to  be:  DD(c )  =  P(c  =  ct\B+,B~). 
We  refer  to  P(c  =  ct)  as  P(c)  to  simplify  the  equations. 
Using  Bayes  rule  and  assuming  independence,  this  can  be 
reduced  to  finding  the  concept  c  for  which  the  likelihood: 
Yli<i<nP{c\Bi)Xli<i<mP(c\Bi)  is  maximal.  The  proba¬ 
bility  that  concept  c  is  the  target  concept  given  the  evi¬ 
dence  available  in  the  bag,  P(c\Bi),  still  needs  to  be  de¬ 
termined.  Maron  discusses  several  ways  to  do  this.  In  this 
work,  we  follow  his  suggestion  of  using  a  noisy-or  model 
(Pearl,  1988),  in  which  case  we  have: 

P(c\B+)  =  1-  11  (1  -P(Btj&c))  (1) 

1  <j<p 

P{c\BT)  =  n  (2) 

1  <j<p 

where  p  is  the  number  of  instances  in  bag  B,  and  /J(7?,  /  £  c) 
is  the  probability  that  the  specified  instance  is  in  the  con¬ 
cept. 

Calculating  P{Bjj  £  c)  requires  a  specific  form  of  target 
concept.  In  the  case  of  flat  data,  Maron  often  used  what 
he  called  the  single-point  concept  which  is  a  point  in  fea¬ 
ture  space.  With  this  concept,  the  calculation  of  P(Bjj  £  c) 
is  based  on  the  Euclidean  distance  between  points  /!,,  and 
c  in  feature  space.  We  need  to  define  P{Bjj  £  c)  when  B,j 
and  c  are  both  attributed  graphs  instead  of  points  in  fea¬ 
ture  space.  To  do  this,  we  need  a  method  for  measuring  the 
distance  between  two  attributed  graphs. 

Metrics  for  measuring  the  distance  between  attributed 
graphs  are  not  as  well  studied  as  metrics  for  flat  data. 


d{G\,G2) 


\MCS(GUG2)\ 
max(|Gi|,  |G2|) 


(3) 


where  MCS(G\ ,  G2)  is  the  maximum  common  subgraph  of 
G 1  and  G2.  This  metric  was  developed  for  unlabeled  graphs 
but  can  be  modified  so  that  the  MCS  also  uses  the  attributes 
to  limit  the  number  of  matches.  A  disadvantage  of  this  met¬ 
ric  is  that  computing  the  MCS  is  exponentially  complete. 
In  the  course  of  a  thorough  search  in  concept  space,  MCS 
is  calculated  frequently.  We  approximate  the  calculation 
by  limiting  the  depth  of  the  recursive  search.  Research  on 
a  principled  polynomial-time  distance  metric  for  attributed 
graphs  is  a  topic  for  future  work.  Based  on  this  metric,  we 
define  P{Bj ,■  £  c)  as: 


P{Bij  £  c) 


\MCS(Bjj,c)\ 

min(|B,y|,|c|) 


(4) 


Note  that  Equation  4  is  a  slight  modification  of  Equation  3 
where  the  maximum  is  replaced  by  a  minimum.  Since  we 
are  searching  for  the  best  subgraph,  it  is  better  to  weight  the 
match  by  the  size  of  the  proposed  subgraph  rather  than  by 
the  size  of  the  instances  or  of  the  bag,  which  could  be  arbi¬ 
trarily  large.  If  the  instances  in  the  bag  are  not  enumerated, 
P(c\Bj)  becomes: 


P(c\B+) 


p(c\b n 


\MCS(c,Bf)\ 

,  \MCS(c,B7)\ 

min(|c|,|Br|)‘ 


(5) 

(6) 


This  means  that  the  probability  that  c  is  the  correct  con¬ 
cept  given  the  evidence  available  in  positive  bag  B f  is  the 
percent  match  of  graph  c  to  graph  Bj.  Likewise,  the  prob¬ 
ability  that  c  is  the  correct  concept  given  the  evidence  in 
negative  bag  BJ  is  one  minus  the  percent  match  of  graph  c 
to  graph  BJ.  In  other  words,  if  c  matches  highly  with  BJ . 
the  probability  that  c  is  correct  will  be  high  but  if  it  matches 
highly  with  BJ ,  the  probability  that  c  is  correct  will  be  low. 


3.2.  Relational  chi-squared  method 

In  addition  to  diverse  density,  we  present  results  using  the 
chi-squared  MI  method  (McGovern  &  Jensen,  2003).  Chi- 
squared  is  simpler  to  calculate  than  diverse  density  and  it 
allows  for  a  more  thorough  search  of  the  concept  space 
because  it  provides  a  guaranteed  pruning  method.  Chi- 
squared  is  calculated  by  filling  in  the  contingency  table 
shown  in  Table  1 .  The  rows  of  the  table  correspond  to  the 


Table  1.  Contingency  table  used  by  the  chi-squared  method.  The 
cells  are  filled  in  using  the  predicted  and  known  labels  for  the 
training  bags  using  the  proposed  concept. 


Actual 
Bag  label 


+ 

- 

Predicted  + 

a 

b 

bag  label 

c 

d 

Figure  2.  Target  concept  for  the  illustrative  data  set 


predicted  label  from  the  concept  and  the  columns  corre¬ 
spond  to  the  actual  labels  for  the  training  bags.  Assuming 
a  method  for  labeling  the  bags  given  a  proposed  target  con¬ 
cept,  the  table  is  filled  out  in  the  following  manner.  If  the 
concept  predicts  that  the  bag  will  be  positive  and  it  is  pos¬ 
itive,  a  is  incremented.  If  the  prediction  is  positive  but  the 
bag  is  really  negative,  b  is  incremented.  If  the  prediction  is 
negative  and  the  bag  is  positive,  c  is  incremented.  Finally, 
d  is  incremented  if  the  concept  predicts  negative  and  the 
bag  is  negative.  Chi-squared  is  calculated  by  summing  the 
squared  differences  for  the  expected  values  in  each  cell  of 
the  contingency  table  versus  the  observed  values. 

The  best  concept  is  defined  as  that  with  the  highest  chi- 
squared  value.  This  will  occur  when  the  mass  is  concen¬ 
trated  on  the  main  diagonal  (e.g.,  in  a  and  cl)  which  means 
that  the  concept  is  predicting  the  most  positive  and  the  most 
negative  bags  correctly.  More  information  about  the  chi- 
squared  evaluation  function  for  MIL  can  be  found  in  (Mc¬ 
Govern  &  Jensen,  2003). 


4.  Experimental  results:  illustrative  data  set 


a  sample  negative  instance  are  shown  in  Figure  3a.  The 
bags  for  the  second  data  representation,  which  contained 
only  one  instance  per  bag,  varied  in  size  from  ten  to  twenty 
objects  and  had  twice  as  many  random  links  as  there  were 
objects.  Example  positive  and  negative  bags  for  this  frame¬ 
work  are  shown  in  Figure  3b.  In  both  cases,  we  generated 
twenty  positive  bags  and  twenty  negative  bags. 

For  this  experiment,  we  compared  the  relative  prediction 
accuracies  for  the  relational  diverse  density  approach,  the 
chi-squared  technique,  and  the  ^-nearest  neighbors  (kNN) 
method.  We  repeat  this  comparison  for  both  data  repre¬ 
sentations.  For  the  diverse  density  and  chi-squared  ap¬ 
proaches,  the  MI  learner  identified  the  best  concept  (or  set 
of  concepts)  for  predicting  the  bag  labels.  Given  a  rela¬ 
tional  concept  c  and  a  bag  I),  with  an  unknown  label,  the 
predicted  real-valued  label  is: 


label  =  max  P(B, ;  e  c) 
t <j<k  K  1  ’ 


\MCS{Bij,c)\ 

max  - T. - ,  ,  . 

i</<*  mm(|5,y|,|c|) 


If  there  is  only  one  graph  in  the  bag,  this  becomes: 


label  = 


\MCS(Bj,c)\ 

min(|B/|,|c|)' 


We  first  present  results  using  a  small  illustrative  database 
where  we  both  know  the  target  answer  in  advance  and  can 
easily  visualize  the  data.  The  objects  and  links  each  have 
one  real-valued  attribute  associated  with  them.  The  target 
concept,  shown  in  Figure  2,  is  a  size-three  clique  with  a 
particular  set  of  attribute  values  on  the  objects  and  links. 

We  illustrate  both  chi-squared  and  diverse  density  using 
both  data  representations  and  this  target  concept.  In  both 
cases,  graphs,  including  objects,  links,  direction  of  the 
links,  and  attributes,  were  generated  randomly.  To  cre¬ 
ate  a  positive  instance,  a  graph  was  randomly  grown  from 
the  target  clique.  Negative  instances  were  randomly  grown 
from  an  empty  graph.  Attribute  values  from  the  target  con¬ 
cept  can  be  used  in  negative  instances  so  long  as  the  entire 
concept  is  not  included.  For  the  first  data  representation, 
both  positive  and  negative  graphs  varied  in  size  from  three 
to  ten  objects  with  the  same  number  of  random  links.  Each 
positive  bag  had  one  positive  instance  and  from  two  to  six 
negative  instances.  Negative  bags  contained  from  three  to 
seven  negative  instances.  A  sample  positive  instance  and 


Under  this  formulation,  the  predicted  label  for  the  bag  will 
be  a  real  number  in  the  interval  [0,1].  A  prediction  of  zero 
means  the  bag  should  be  labeled  as  negative  and  a  predic¬ 
tion  of  one  means  that  the  bag  should  be  labeled  as  positive. 
Values  in  the  range  [0, 1]  are  also  possible  and  we  examine 
the  best  choice  of  thresholds  through  the  use  of  an  ROC 
curve  that  measures  the  ratio  of  true  positives  to  false  posi¬ 
tives  as  the  threshold  varies  from  zero  to  one. 

We  used  kNN  as  a  baseline  for  comparison.  We  identify 
the  k  nearest  neighbors  using  the  distance  metric  specified 
in  Equation  3.  Because  the  true  labels  for  the  individual 
instances  are  unknown,  multiple  instances  in  a  bag  are  all 
assumed  to  have  the  same  label  as  the  bag.  If  the  instances 
are  not  individually  enumerated,  we  assign  the  label  to  the 
graph  representing  the  bag  itself  and  use  this  larger  graph 
for  the  kNN  calculations.  We  modify  the  prediction  mech¬ 
anism  of  kNN  in  the  following  manner.  For  each  instance 
in  an  unlabeled  bag,  we  determine  the  ratio  of  positive  in¬ 
stances  in  the  k  nearest  instances.  The  most  extreme  of 
these  ratios  weighted  by  the  number  of  different  positive  or 


A:  Sample  instances  in  a  bag 
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Figure  3.  A:  Example  instances  for  the  three-clique  task  for  the 
representation  where  each  instance  is  enumerated.  The  target  con¬ 
cept  is  shown  with  dashes.  B:  Example  bags  for  the  three-clique 
task  for  the  representation  where  each  bag  consists  of  a  single 
large  graph. 
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Figure  4.  ROC  curve  comparing  the  performance  of  chi-squared, 
diverse  density,  and  kNN  on  the  illustrative  data  set.  In  this  case, 
each  bag  had  an  enumerated  set  of  instances. 
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Figure  5.  ROC  curves  for  the  illustrative  data  set  where  each  bag 
had  one  large  graph. 


perfect  performance  but  it  was  able  to  quickly  find  the  tar¬ 
get  concept  for  this  task.  The  relational  diverse  density  ap¬ 
proach  sometimes  found  a  subset  of  the  true  concept  which 
gave  it  a  small  false  positive  rate  depending  on  the  thresh¬ 
old  chosen  to  differentiate  between  positive  and  negative 
predictions.  The  two  kNN  approaches  shown  had  higher 
accuracy  than  diverse  density  for  very  high  thresholds  but 
quickly  degraded  in  performance  while  diverse  density  was 
more  robust  to  threshold  changes.  At  a  threshold  of  0.5, 
the  accuracies  were:  chi-squared  =  100%,  diverse  density 
=  85%,  and  kNN  =  80%  and  70%  for  k  =  4  and  k  =  20. 
Accuracy  is  the  percent  of  bag  labels  in  the  test  set  that  are 
predicted  correctly. 

Figure  5  shows  the  ROC  curve  for  the  same  three  meth¬ 
ods  in  the  case  where  each  bag  had  only  one  large  graph  in 
it.  With  a  threshold  of  0.5,  chi-squared  had  100%  accuracy, 
diverse  density  had  92.5%  accuracy,  and  kNN  had  70%  and 
50%  accuracies  for  k  =  3  and  k  =  10.  These  results  are 
comparable  with  those  presented  above  and  demonstrate 
that  both  data  representations  can  be  used  successfully  for 
MIL  on  relational  data.  Our  next  experiments  focus  on  a 
much  larger  database. 


negative  bags  that  contributed  to  the  ratio  is  chosen  and  the 
ratio  itself  (without  the  weighting)  is  output  as  the  label. 
The  idea  of  weighting  the  ratio  this  way  is  related  to  di¬ 
verse  density  and  helps  to  make  kNN  a  higher  performing 
baseline  for  comparison. 

Figure  4  shows  the  ROC  curves  for  relational  diverse  den¬ 
sity,  chi-squared,  and  two  values  of  k  for  kNN  for  the  first 
data  representation,  where  there  are  multiple  enumerated 
instances  per  bag.  These  numbers  are  averaged  over  10- 
folds  of  cross  validation.  The  test  set  for  each  fold  was 
2  positive  bags  and  2  negative  bags  and  the  training  set 
was  the  remaining  18  positive  bags  and  18  negative  bags. 
The  chi-squared  method  identifies  the  correct  target  con¬ 
cept  each  time  and  had  perfect  prediction  for  this  task.  We 
do  not  claim  that  the  chi-squared  method  will  always  have 


5.  Experimental  results:  IMDb 

The  IMDb  is  a  much  larger  database  with  one  million  ob¬ 
jects  and  nearly  five  million  links.  This  is  a  large  database 
where  the  ability  to  identify  predictive  structures  should 
help  us  to  better  understand  the  nature  of  the  database.  The 
two  tasks  that  we  present  are:  predicting  which  movies  will 
be  nominated  for  academy  awards  and  predicting  which 
movies  will  gross  at  least  two  million  dollars  (adjusted  for 
inflation)  during  opening  weekend.  Both  of  these  tasks  are 
very  difficult  and  if  there  were  a  perfect  predictor  of  movie 
success,  then  studio  executives  would  have  identified  it  al¬ 
ready.  Also,  both  tasks  rely  on  an  unknown  number  of  fac¬ 
tors  which  may  not  all  be  in  the  database  (e.g.,  Hollywood 
politics  are  not  included  in  IMDb).  However,  the  difficulty 


Query  constraints: 

movie2.year,  release.year,  Actor-award.year  <  movie. year 
Director-award.year,  Producer-award.year  <  movie. year 
1970  <  movie.year  <  2000 

Figure  6.  qGraph  query  used  to  identify  high-grossing  movies  and 
to  create  the  positive  bags.  Dashed  circles  indicate  the  query  re¬ 
striction  and  number  ranges  indicate  the  minimum  number  of  ob¬ 
jects  required  for  a  match. 


of  the  tasks  provides  a  good  challenge  for  our  approach. 

5.1.  High-grossing  movies 

The  IMDb  is  a  large  connected  database  and  thus  corre¬ 
sponds  to  the  second  data  representation  where  each  bag 
contains  only  one  instance.  We  created  the  bags  by  query¬ 
ing  the  database  using  the  qGraph  query  shown  in  Figure 
6.  This  query  is  the  depth-two  structure  surrounding  high- 
grossing  movies  with  the  exception  that  we  do  not  follow 
links  from  studios.  Studios  typically  make  hundreds  of 
movies  and  following  those  links  would  lead  to  unneces¬ 
sarily  large  graphs.  This  query  returns  a  set  of  subgraphs 
from  the  database  that  match  the  specified  structure.  In 
particular,  each  subgraph  will  contain  a  central  movie  ob¬ 
ject  and  its  related  release  objects  where  at  least  one  re¬ 
lease  grossed  more  than  2  million  dollars  on  opening  week¬ 
end.  In  addition,  any  associated  studios,  genres,  producers, 
directors,  actors/actresses,  and  related  movies  will  be  in¬ 
cluded  in  each  subgraph.  If  any  of  the  producers,  direc¬ 
tors,  actors/actresses,  or  related  movies  have  award  objects 
linked  to  them,  these  will  also  be  included.  Finally,  the 
graph  is  pruned  to  remove  any  events  that  occurred  after 
the  movie’s  release.  This  is  necessary  because  we  want 
the  structures  that  the  MI  learner  identifies  to  predict  for¬ 
ward  in  time.  To  help  minimize  noise  and  the  size  of  the 
data,  we  further  restrict  the  set  to  only  contain  movies  from 
1970  to  2000.  We  randomly  sampled  this  set  to  obtain  ap¬ 
proximately  200  positive  instances.  We  reused  the  same 
query  stmcture  to  generate  the  negative  bags  except  that 
the  releases  on  opening  weekend  were  restricted  to  gross 
less  than  2  million  dollars.  There  are  a  considerable  num- 


Figure  7.  ROC  curves  comparing  the  false  positive  and  true  posi¬ 
tive  ratios  for  the  chi-squared  MI  technique  and  kNN  on  the  task 
of  predicting  high-grossing  movies. 


her  of  such  movies  so  we  randomly  subsampled  to  obtain 
approximately  200  negative  bags. 

We  again  ran  10-fold  cross  validation  and  obtained  predic¬ 
tions  for  the  unseen  positive  and  negative  bags  from  the  top 
five  percent  of  the  concepts  identified  by  MIL  where  the 
concepts  are  ranked  by  their  chi-squared  values.  The  in¬ 
ability  to  prune  with  diverse  density  hinders  its  use  on  such 
a  large  data  set  so  we  used  only  the  chi-squared  approach. 

Figure  7  shows  the  results  of  this  experiment  using  ROC 
curves.  The  chi-square  method  was  able  to  detect  several 
substructures  that  predicted  high-grossing  movies.  The  re¬ 
sults  shown  in  this  graph  are  for  the  most  highly  ranked 
concept  on  each  of  the  10  folds,  labeled  chi-squared  TOR 
and  for  the  top  5%  of  the  concepts,  labeled  chi-squared  OR. 
In  the  latter  case,  each  concept  outputs  a  separate  prediction 
and  we  used  the  OR,  or  max,  of  these  predictions.  Although 
MIL  has  slightly  lower  performance  in  the  region  of  the 
ROC  curve  with  higher  true  positives  but  also  higher  false 
positives,  its  performance  is  better  than  kNN  in  the  region 
with  lower  false  positives  and  higher  true  positives.  Also, 
its  performance  only  degrades  as  the  threshold  is  dropped 
almost  to  zero  while  kNN  is  less  robust  to  the  threshold 
value.  With  a  threshold  of  0.5,  chi-squared  TOP  achieves 
an  accuracy  of  69.2%  and  chi-squared  OR  has  a  70.1%  ac¬ 
curacy.  kNN’s  accuracies  are  61%  and  53.6%  for  k  =  1 
and  k  =  10.  With  this  prediction  mechanism,  studios  could 
better  allocate  money  to  movies.  As  we  said  in  the  begin¬ 
ning  of  this  experiment,  predicting  high-grossing  movies 
is  a  difficult  task  and  it  is  unlikely  that  any  learning  agent 
could  achieve  high  accuracy. 

One  of  the  other  benefits  of  using  MIL  on  this  database, 
besides  prediction  accuracy,  is  that  the  answers  are  in  the 
form  of  understandable  structures.  Figure  8  shows  some 
of  the  top  stmctures  for  predicting  high-grossing  movies. 
It  seems  that  movies  are  more  likely  to  be  high-grossing  if 
they  are  related  to  two  or  three  other  movies  (e.g.  a  movie 


Figure  8.  Top  predictive  relational  structures  identified  by  the  MI 
learner  on  the  high-grossing  movie  task. 

in  a  series  like  Star  Trek  or  Indiana  Jones  or  movies  that 
remade  previous  successful  movies).  Another  predictive 
structure  that  the  chi-squared  MI  learner  identified  was  a 
movie  related  to  another  movie  that  was  both  nominated 
and  awarded  an  academy  award  for  best  picture.  Last,  just 
the  presence  of  a  best  picture  award  object  in  the  subgraph 
was  predictive  of  movie  success. 

5.2.  Movies  nominated  for  academy  awards 

We  repeated  the  same  experiments  for  the  difficult  task  of 
predicting  which  movies  will  be  nominated  for  academy 
awards  each  year.  The  query  used  to  generate  the  positive 
bags  is  shown  in  Figure  9.  The  structure  of  this  query  is 
identical  to  that  discussed  for  high-grossing  movies  except 
that  we  require  an  academy  award  nomination.  The  posi¬ 
tive  bags  do  not  actually  contain  the  award  objects  for  the 
central  movie  because  we  want  MIL  to  identify  predictive 
structures.  This  query  yields  72  positive  bags.  We  use  the 
same  query  minus  the  requirement  for  the  awards  to  cre¬ 
ate  the  negative  bags.  The  number  of  movies  which  are 
not  nominated  for  academy  awards  is  quite  large  and  we 
randomly  sample  this  set  to  obtain  approximately  the  same 
number  of  negative  bags  (74). 

We  again  compare  the  predictive  ability  of  chi-squared  to 
kNN  on  10-fold  cross  validation  with  this  data  set.  These 
results  are  shown  in  Figure  10,  again  using  ROC  curves.  In 
this  case,  the  structures  found  by  the  MI  learner  dominate 
any  of  the  predictions  from  kNN  for  all  values  of  k  (We 
show  two  of  the  best  values  of  k  in  the  figure).  Assuming 
a  threshold  of  0.5,  the  accuracy  of  chi-squared  TOP  is  93% 
and  the  accuracy  of  chi-squared  OR  is  77%.  kNN  has  an 
accuracy  of  49.7%  and  50.7%  for  k  =  5  and  k  =  10. 

We  also  examine  the  relational  structures  that  the  MI 
learner  identified  as  predictive  of  whether  a  movie  will  be 
nominated  for  an  academy  award.  Some  of  the  top  struc¬ 
tures  are  shown  in  Figure  1 1 .  For  this  task,  it  seems  that 
movies  with  at  least  20  actors  in  them  are  more  likely  to  be 
nominated  for  academy  awards.  This  is  surprising  and  is 
likely  due  to  a  reverse  effect  that  better  movies  have  more 
information  in  IMDb  which  means  they  tend  to  have  more 
actors  associated  with  them.  A  related  structure  has  the 
same  form  but  restricts  the  genre  to  drama.  These  structures 


Query  constraints: 

movie2.year,  release-year,  Actor-award.year  <  movie.year 
Director-award.year,  Producer-award.year  <  movie.year 
1970  <  movie.year  <  2000 

Figure  9.  qGraph  query  used  to  create  the  positive  bags  for 
the  task  of  predicting  which  movies  will  be  nominated  for  an 
academy  award. 


Figure  10.  ROC  curves  for  the  task  of  predicting  which  movies 
will  be  nominated  for  academy  awards. 

may  not  help  a  studio  executive  to  better  allocate  money  to 
new  movies  but  they  did  identify  an  important  characteris¬ 
tic  of  the  database,  which  is  one  of  the  goals  of  this  work. 
The  presence  of  only  a  drama  object  is  enough  to  predict 
the  nomination  in  many  cases.  Last,  if  a  previous  award  ob¬ 
ject  existed  in  the  subgraph,  e.g.,  if  the  movie  was  related 
to  a  movie  that  was  also  nominated  or  won  an  academy 
award,  it  was  likely  to  be  nominated  itself. 

6.  Discussion  and  Conclusions 

In  this  paper,  we  have  presented  an  approach  to  identify¬ 
ing  predictive  structures  in  relational  databases  based  on 
the  MIL  framework.  We  adapted  this  framework  for  use 
with  relational  data  in  two  related  ways:  one  where  the 
bags  had  multiple  independent  graphs  as  the  instances  and 
one  where  the  bags  had  one  larger  graph  and  the  instances 
were  the  (implicit)  subgraphs  of  this  graph.  We  demon- 


Figure  11.  Relational  structures  identified  by  the  MI  learner  for 
predicting  academy  award  nominees. 


stated  that  these  adaptations  could  be  used  to  modify  ex¬ 
isting  MI  methods  and  that  the  relational  version  of  these 
methods  could  be  used  successfully  on  both  a  small  and  a 
large  database. 

One  of  the  strengths  of  MIL  that  is  emphasized  for  flat  data 
is  the  ability  to  identify  which  features  of  the  task  are  im¬ 
portant.  In  the  diverse  density  framework,  this  is  referred 
to  as  scaling.  When  the  concept  is  a  feature  vector,  di¬ 
verse  density  can  identify  a  scale  for  each  feature  that  max¬ 
imizes  the  diverse  density  value.  If  a  feature  is  irrelevant, 
its  best  scale  will  be  zero.  This  strength  also  applies  to 
the  techniques  that  we  presented  in  this  paper.  Instead  of 
scaling  features  in  a  vector,  the  concepts  identified  by  the 
relational  MI  learner  will  only  contain  a  subset  of  the  ob¬ 
jects  and  links  from  the  bags.  This  subset  represents  the 
more  relevant  features  with  respect  to  the  current  task. 

Another  advantage  of  MI  techniques  is  that  they  identify 
an  actual  concept  (or  set  of  concepts)  that  can  be  under¬ 
stood  by  a  human.  kNN  can  be  used  to  label  new  data 
but  it  cannot  identify  aspects  of  the  data  that  can  help  a 
human  to  better  understand  the  database.  With  such  struc¬ 
tures,  a  human  can  iteratively  refine  their  understanding  of 
the  database  and  of  the  tasks  at  hand. 

Relational  probability  trees  (RPT)  (Neville  et  al.,  2003)  are 
a  related  approach  in  that  they  have  also  been  developed  to 
identify  predictive  structure  in  large  relational  databases. 
However,  MIL  and  RPTs  express  different  relational  con¬ 
cepts.  RPTs  are  designed  to  identify  structure  in  a  tree  form 
using  attributes  on  objects  or  links  or  structure  such  as  the 
number  of  outgoing  links  from  an  object.  Although  this  can 
work  very  well  on  tasks  such  as  predicting  high-grossing 
movies,  RPTs  cannot  represent  graph  concepts  such  as  the 
3-clique  presented  in  Section  4. 
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